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Abstract 

The solution of large sparse linear systems is often the most time-consuming part 
of many science and engineering applications. Computational fluid dynamics, 
circuit simulation, power network analysis, and material science are just a few 
examples of the application areas in which large sparse linear systems need to be 
solved effectively. In this paper we introduce a new parallel hybrid sparse linear 
system solver for distributed memory architectures that contains both direct and 
iterative components. We show that by using our solver one can alleviate the 
drawbacks of direct and iterative solvers, achieving better scalability than with 
direct solvers and more robustness than with classical preconditioned iterative 
solvers. Comparisons to well-known direct and iterative solvers on a parallel 
architecture are provided. 

Keywords: sparse linear systems, parallel solvers, direct solvers, iterative 
solvers 



1. Introduction 

Many applications in science and engineering give rise to large sparse linear 
systems of equations. Some of these systems arise in the discretization of partial 
differential equations (PDEs) modeling various physical phenomena, such as in 
computational fluid dynamics , semiconductor device simulations, and material 
science. Large and sparse linear systems also arise in applications that are not 
governed by PDEs (e.g. power system networks, circuit simulation, and graph 
problems ). 

Numerical simulation processes often consist of many layers of computational 
loops (e.g. see Figure [T]). It is a well known fact that the cost of the solution 
process is almost always governed by the solution of the linear systems especially 
for large-scale problems. 

The emergence of multicore architectures and highly scalable platforms mo- 
tivates the development of novel algorithms and techniques that emphasize con- 
currency and are tolerant of deep memory hierarchies, as opposed to minimizing 
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Loop: Time Integration 

Loop: Nonlinear Iteration 
Loop: Linear Systems 

On parallel computing platforms; multicore to petascale 

architectures 
End e 

End r\ 

End At 



Figure 1: Target computational loop 



raw FLOP counts. While direct solvers are reliable, they are often memory- 
intensive for large problems and offer limited scalability. Iterative solvers, on 
the other hand, are more efficient but, in the absence of robust preconditioncrs, 
lack reliability. 

In this paper we introduce a parallel sparse linear system solver that is 
hybrid. We note that we are using the term "hybrid" to emphasize that our 
solver is using both direct and iterative techniques. We advocate that using our 
solver in hybrid mode one can alleviate the drawbacks of direct and iterative 
solvers, i.e. achieving more scalability than a direct solver and more robustness 
than a classical preconditioned iterative solver. 

The rest of this paper is organized as follows. In Section 2, we discuss 
background and related work. In Section 3, we give a description of the new 
algorithm and a simple example to demonstrate the details of the implemen- 
tation. In Section 4, we present variety of numerical experiments. Finally, we 
conclude the paper with discussions in Section 5. 



2. Background and related work 



Considerable effort has been spent on algebraic parallel sparse linear system 
solvers. Sparse linear system solvers are traditionally divided into two groups 
(i) direct solvers (ii) iterative solvers. In the first group some examples are 
MUMPS [lllli , Pardiso 0,0, and SuperLU 0. 

Iterative solvers mainly consist of classical preconditioned Krylov Subspace 
methods, and preconditioned Richardson iterations. Unlike direct sparse system 
solvers, iterative methods (with classical blackbox preconditioners) are not as 
robust. This is true even with the most recent advances in crea ting LU-based 
preconditioners 0, @, Q . Approximate inverse preconditioners 1 1 lL Tl2 , 
arc known to be more favorable for parallelism. 
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The Spike algorithm [U, EE ES EE EE EE E2 is a parallel solver for banded 
systems, that combines direct and iterative methods, is one of the first examples 
of hybrid linear system solvers. More recently in 22, 2^, 24 1, the Spike algo- 
rithm was used for solving banded systems involving the preconditioner that is 
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obtained after reordering the coefficient matrix with weights for sparse linear 
systems. 



3. Domain decomposing parallel solver 



We introduce a new parallel hybrid sparse linear system solver called Domain 
Decomposition Parallel Solver (DDPS) which can be used for solving sparse 
linear systems of equations: Ax = /. Recently, we have presented an algorithm 
that used incomplete lu factorization for the diagonal block and its application 
on fluid structure interaction problems pBl 26 1. In this paper we introduce 



DDPS that uses the direct solver Pardiso within each block and extend the 
results to general sparse systems from a variety of application areas. 

We arc motivated to create DDPS due to the fact that many applications use 
domain decomposition to distribute the work among the processors and the lack 
of reliability of black box preconditioned Krylov subspace methods and lack of 
scalability of direct solvers. METIS [13, S| is often used to partition the domain 
(and hence to partition the matrices). DDPS is similar to the Spike algorithm 
but unlike Spike it does not assume banded structure for the coefficient matrix 
A. Given a general sparse linear system Ax = /, we partition A G R nxn into p 
block rows A = [Ai , An, A p ] T . Let 



A = V + R, 



where T> consists of the p block diagonals of A, 
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and R consists of the remaining elements (i.e. R = A — V). Let Li and Ui be 
an incomplete LU factorizations of An where i = 1,2,..., p. We define 
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in which An = LiUi. 

The DDPS algorithm is shown in Figure [5] We assume the system Ax = f 
is the one after METIS reordering 

Stages 1-5 are considered as a preprocessing phase where the right hand side 
is not required. After preprocessing we solve the system via a Krylov subspace 
method and using a preconditioner. The major operations in a Krylov subspace 
method are: (i) matrix vector multiplications, (ii) inner products, and (hi) 
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Data: Ax = f and a partitioning information 
Result: x 

1. T> + R < — A for the given partitioning information; 

2. LiUi < — An (approximate or exact) for i = 1, 2, ...,p ; 

3. R < — R (by dropping some elements) ; 

4. G i — ±>- x R ; 

5. identify nonzero columns of G and store their indices in array c ; 

6. Solve Ax = / via a Krylov subspace method with a preconditioncr 
P = V + R and stopping tolerance e ut) 



Figure 2: DDPS algorithm. 



solve Pz = y 

{f)- l Pz = f)- x y => (7 + G)z = g) ; 

6.1 g < — f>- 1 y ; 

6.2 G <— (I(c,c) + G(c,c)); z «— z(c); g 4— 5 (c) ; 

6.3 solve the smaller independent system: Gz = p (directly or 
iteratively with stopping tolerance tin) ; 

6.4 z(c) < — z ; 

6.5 z < — g — Gz ; 

end 



Figure 3: Preconditioning operation: Pz = y 

preconditioning operations in the form of Pz = y (for some y) . Only the details 
of the preconditioning operations for DDPS are given in Figure [3] 

Each stage, with the exception of solving the reduced system, can be exe- 
cuted with perfect parallelism requiring no intcrprocessor communications. The 
solution of the smaller system Gz = g is the only part of the preconditioning op- 
eration that require communication. The size of G is problem and partitioning 
dependent and it is expected to have an influence on the overall scalability of 
the algorithm. The size of G is determined by the number of nonzero columns 
in G. We employ several techniques to reduce the size of G: 

• METIS reordering to reduce the total communication volume for given 
number of partitions and hence reducing the size of G by reducing the 
number of elements in R. (We note that METIS works on undirected 
graphs, therefore we apply METIS on (\A\ + \A T \)/2). 

• A dropping strategy: Given a tolerance S G [0, 1], if for any column k in 
Ri \\R{:, fc)i||oo < S x maxj | \R(:,j)i \ \oo (i = l,2,..,p) we do not consider 
that column when forming G. Here Ri is the block row partition of R 
(i.e. R = [i?i,i?2, ...,i? p ] T ). Another possibility is to drop elements after 
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computing G. In this paper, however, we only consider the former as the 
latter is expected to be computationally expensive. 

It is required the diagonal blocks, An, are nonsingular. In case they are 
singular, however, in addition to METIS applying HSL MC64 reordering and/or 
a diagonal perturbation can be considered. 

Notice that dropping elements from R in stage 3 to reduce the size of G re- 
sults in an approximation of the solution. Furthermore, we can use approximate 
LU-factorization of the diagonal blocks in stage 2 and solve Gx = g iteratively 
in stage 6.3. Therefore, we place an outer iterative layer where we use the 
above algorithm as a solver for systems involving the preconditioner P = T> + R 
where R consists only of the columns that are not dropped. We stop the outer 
iterations when the relative residual at the k th iteration 1 1 1 1 oo / 1 1 ro 1 1 oo < £out- 

DDPS is a direct solver if (i) nothing is dropped from R, (ii) exact LU 
factorization of An is computed, and (iii) Gz = g is solved exactly. In the case 
of using DDPS as a direct solver, an outer iterative scheme may not be required 
but recommended. In this paper we use the direct solver Pardiso for computing 
LU factorization of the diagonal blocks. 

The choices we make in stages 2,3, and 6.3, result in a solver that can be 
as robust as a direct solver or as scalable as an iterative solver, or anything in 
between. Notice that the outer iterative layer also benefits from our partitioning 
strategy as METIS reduces the total communication volume in parallel sparse 
matrix vector multiplications. 

We note that G consists of dense columns within each partition which we 
store as a two dimensional array in memory and as a result matrix vector mul- 
tiplications can be done via level 2 BLAS [S, EH (or level 3 in case of multiple 
right hand sides). 

In order to illustrate the steps of the basic DDPS algorithm (without any 
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Block diagonal matrix D is indicated in red color (or bold in black and white) 
for 3 partitions where each partition is of size 3. After prcmultiplying both sides 
with D^ 1 from left we obtain the modified system, (I + G)x = g (we do not 
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need to form D 1 explicitly to compute D 1 R) 
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We note that unknowns 1, 2, 5, and 9 form a smaller independent reduced 
system (indicated in blue color or bold in black and white) , 
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which has the solution [x\, X2, %5, xg] T = [—3.2389,3.4413, 
Finally, we can retrieve the solution of the system via 
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and obtain : 



-3.2389, 3.4413, 1.7766, -2.7063, -0.1151, 0.9405, 0.365, 0.5402, 1.5766] 



4. Numerical experiments 

The set of p roblems is obtained from the University of Florida Sparse Matrix 



Collection |32| . We choose the largest nonsymmctric matrix from each applica- 
tion domain. The list of the matrices and their properties are given in Table [TJ 
For each matrix we generate the corresponding right hand-side using a solution 
vector of all ones to ensure that / £ span(A). All numerical experiments are 
performed on an Intel Xeon (X5560@2.8GHz) cluster with Infiband intercon- 
nection and 16GB memory per node. The number of MPI processes is equal 
to the number of cores used and is also equal to the number of partitions for 
DDPS. 
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In the following numerical experiments, we use a variation of precondi- 
tioned BiCGStab [29[ as the outer iterative solver. The smaller reduced sys- 
tem Gz = g is also solved iteratively via BiCGStab without preconditioning. 
For the iterative solvers, the outer iterations are terminated when the number 
of iterations reaches 1,000 or the relative residual meets the stopping criterion 
(||/ - Ax\ loo/] |/| |oo < 10~ 5 ). Failures of the solvers are indicated by Fl or F2 
when the solver runs out of memory or the final relative residual is larger than 
10 -5 , respectively. We limit the maximum number of iterations to 100 and the 
stopping tolerance to t - m = 10~ 4 for the inner iterations of DDPS solver. 

We use ILUPACK with the following parameters, reorderings: weighted 
matching and AMD [33|, droptol: 10 _1 , estimate for the condition numbers 
of the factors: 50, and an elbow space of 10 which are recommended by the 
user guide for general sparse linear systems. ILUPACK uses GMRES(30) with 
a variation of incomplete LU factorization based preconditioner. MUMPS and 
Pardiso has been used with their default parameters and using METIS reorder- 
ing. 

In TableHwe present the total solve time for MUMPS, Pardiso , DDPS(c5 = 
0.9) and ILUPACK. For 5 systems (indicated by blue) out of 9, DDPS is faster 
than MUMPS (for 16 MPI processors). In addition, DDPS is more robust than 
ILUPACK and almost as robust as MUMPS direct solver, using 16 partitions 
DDPS fails only in 2 cases while ILUPACK and MUMPS fails in 5 and 1 case, re- 
spectively. DDPS never runs out of memory while MUMPS runs out of memory 
for one of the problems unless more than 8 partitions are used. 

The speedup with respect to Pardiso solver using a single core is given in 
Table [3] We note that two problems achieve superlinear speed improvement 
due to cache effects. 

In Tabled the number of outer BiCGStab iterations for DDPS is provided as 
one increases the number of partitions. With the exception of two cases, namely 
hvdc2 and thermomech_dk, the number of iterations depends weakly (less than 
linear) on the number of partitions (or MPI processes) . 

The average number of inner BiCGStab iterations is given in Table [5] Since 
we make sure the reduced system size is small via various techniques described 
earlier, number of iterations are relatively small for all systems with weak de- 
pendence on number of processes. 

In Table [5] we show the effect of varying the drop tolerance, 5, while the 
number of partitions is fixed at 16. A small 5 results in a variation of DDPS 
that is more like a direct solver. Although this causes the number of iterations 
to decrease, it also increases the memory requirement and the solver runs out of 
memory For small S memory problem appears in two cases, namely rajatSl and 
atmosmodl. In 5 cases the number of outer iterations decrease as we decrease 
S. In the remaining two cases the DDPS solver failed even though S is set to be 
a small number. 
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5. Conclusion 

Wc have introduced a new hybrid sparse linear system solver called DDPS. 
We have shown that our new sparse linear system solver is often faster than 
direct solvers and more robust than classical preconditioned Krylov subspacc 
methods. DDPS is flexible as it can be used in a variety of configurations. 
Depending on the solver for the diagonal blocks a new variation of the algorithm 
will arise. The choice we make for solving the inner reduced system further 
increases the number of possibilities. Although we have used METIS to show 
the application of the algorithm on general sparse systems, DDPS algorithm 
is ideally suited for problems in which the matrices are already distributed via 
domain decomposition to minimize interprocessor communication. 
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Table 1: Linear systems from the University of Florida Sparse Matrix Collection, n 
the degree of diagonal dominance, respectively 



and dd stands for matrix size , number of nonzcros, and 
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1,489,752 


10,319,760 









computational fluid dynamics 
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power network 
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directed weighted graph 
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semiconductor device simulation 
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electromagnetic 
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Tabic 2: Total solve times (in seconds) for MUMPS, Pardiso, and DDPS, and ILUPACK 



MPI Processes 
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1 


2 
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Pardiso 
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DDPS 
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16 


ILUPACK 
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Fl 
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Fl 


Fl 


171.3 


1291.0 


391.6 


781.0 


149.1 


100.7 


13.6 
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1.5 


1.6 


1.4 
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1.9 


2.0 


1.4 


1.6 


6.9 


F2 
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LANGUAGE 
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273.9 


F2 


F2 


F2 


1191.3 


124.7 


15.2 


6.4 


2.0 


3.4 
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42.5 


27.3 


19.3 


13.2 


8.5 


43.9 


21.2 


9.4 


F2 


F2 


F2 
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78.3 


67.6 


59.3 


54.1 


53.7 


57.9 
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45.1 


F2 


THERMO 
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11.2 
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14.5 


12.1 


10.3 


9.8 


9.7 


10.8 


170.7 


140.2 


99.0 


77.8 


F2 


TORS03 


40.2 


26.3 


18.2 


12.4 


9.6 


49.4 


20.6 


10.0 


4.0 


2.1 


2.2 


XENON2 


13.5 


8.2 


6.1 


4.5 


4.2 


14.7 


14.9 


7.6 


3.9 


2.9 


F2 
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Table 3: Speedup of DDPS compared to Pardiso 





Pardiso 


DDPS 








MPI Processes 


1 


2 


4 


8 


16 


ATMOSMODL 


1.0 


3.3 


1.7 


8.7 


12.8 


HVDC2 


1.0 


1.4 


1.2 


0.3 


F2 


LANGUAGE 


1.0 


9.6 


78.2 


185.7 


609.4 


0HNE2 


1.0 


2.1 


4.8 


F2 


F2 


RAJAT31 


1.0 


0.2 


0.4 


0.6 


1.3 


THERMO 


1.0 


1.5 


0.3 


0.2 


0.3 


TMT.UNSYM 


1.0 


0.1 


0.1 


0.1 


0.1 


TORS03 


1.0 


2.4 


4.9 


12.2 


23.6 


XENON2 


1.0 


1.0 


1.9 


3.7 


5.1 



Table 4: Number of outer BiCGStab iterations for DDPS and ILUPACK 



MPI Processes 


DDPS 
2 


4 


8 


16 


ILUPACK 
1 


ATMOSMODL 


18 


18 


21.5 


21.5 


26 


HVDC2 


0.5 


12.5 


260 


F2 


F2 


LANGUAGE 


5 


7 


6 


6 


4 


OHNE2 


0.5 


0.5 


F2 


F2 


F2 


RAJAT31 


71.5 


86.5 


106.5 


99 


F2 


THERMO 


84.5 


248.5 


752.5 


856 


31 


TMT.UNSYM 


89.5 


192 


212 


294 


F2 


TORS03 


8.5 


10 


8.5 


8.5 


5 


XENON2 


53 


63 


67 


90.5 


F2 



Table 5: Average number of inner BiCGStab iterations for DDPS 



MPI Processes 


2 


4 


8 


16 


ATMOSMODL 


0.5 


3.28 


0.5 


4.74 


HVDC2 


0.5 


3.12 


14.93 


F2 


LANGUAGE 


0.5 


0.5 


0.92 


1 


OHNE2 


3.5 


3.5 


F2 


F2 


RAJAT31 


3.54 


3.16 


7.18 


4.95 


THERMO 


1 


1 


2.93 


3.7 


TMT.UNSYM 


13.2 


3.32 


12.14 


15.32 


TORS03 


4.32 


2.9 


3.35 


4.53 


XENON2 


1 


1 


1 


4.7 
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Tabic 6: Number of outer BiCGStab iterations for DDPS for 16 MPI processes 



s 


0.99 


0.9 


0.6 


0.3 


0.1 


1.0E-5 


ATMOSMODL 


23.5 


21.5 


19 


Fl 


Fl 


Fl 


HVDC2 


F2 


F2 


F2 


F2 


F2 


8 


LANGUAGE 


7 


6 


6 


4 


2.5 


1 


0HNE2 


F2 


F2 


F2 


F2 


F2 


F2 


RAJAT31 


99 


99 


99 


Fl 


Fl 


Fl 


THERMOMECH_DK 


645.5 


856 


F2 


F2 


F2 


414 


TMTJJNSYM 


294 


294 


231 


F2 


F2 


F2 


T0RS03 


8.5 


8.5 


8.5 


6.5 


4 


1 


XEN0N2 


99 


90.5 


105.5 


F2 


F2 


1 



